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Abstract 

State of the art classification algorithms are designed to minimize the misclassification 
error of the system, which is a linear function of the per-class false negatives and false 
positives. Nonetheless non-linear performance measures are widely used for the evaluation 
of learning algorithms. For example, E-measure is a commonly used non-linear perfor¬ 
mance measure in classification problems. We study the theoretical properties of a subset 
of non-linear performance measures called pseudo-linear performance measures which in¬ 
cludes E-measure, Jaccard index, among many others. We establish that many notions of 
E-measures and Jaccard index are pseudo-linear functions of the per-class false negatives 
and false positives for binary, multiclass and multilabel classification. Based on this obser¬ 
vation, we present a general reduction of such performance measure optimization problem 
to cost-sensitive classification problem with unknown costs. We then propose an algorithm 
with provable guarantees to obtain an approximately optimal classifier for the E-measure 
by solving a series of cost-sensitive classification problems. The strength of our analysis 
is to be valid on any dataset and any class of classifiers, extending the existing theoret¬ 
ical results on binary E-score, which are asymptotic in nature. Our analysis shows that 
thresholding cost-insensitive scores, a common technique employed to optimize E-measure, 
yields sub-optimal results. We also establish the multi-objective nature of the E-measure 
maximization problem by linking the algorithm with the weighted-sum approach used in 
multi-objective optimization. We present numerical experiments to illustrate the relative 
importance of cost asymmetry and thresholding when learning linear classifiers on various 
E-measure optimization tasks. 

Keywords: machine learning, cost-sensitive classification, pseudo-linear performance 

measures, E-score, Jaccard index 

1. Introduction 

Different performance measures exist to assess the efficiency of learning algorithms. Mis¬ 
classification rate is the most commonly used performance measure in classification systems. 
Like many other measures; which we will investigate in this paper, it is defined over the set 
of classification outcomes. The four possible outcomes of a classifier are True Positive (tp), 
True Negative ( tn ), False Negative (fn) and False Positive (fp). Misclassification rate is a 
linear function of these outcomes, defined as the sum of fp and fn. Conceptually, classifica¬ 
tion algorithms solve optimization problems where we optimize a loss function corresponding 
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to the performance measure (see ??). For example, the loss function that corresponds to 
misclassfication rate is 0-1 loss. 

As mentioned, misclassification rate is a commonly used performance measure, albeit 
unsuitable for specific class of problems. For example, consider the classification (binary) 
of an imbalanced dataset of size 100 with 95 being samples of one specific class (let us say 
negative) and 5 being other class (say positive). A trivial classifier of the form ‘always 
predict negative’ results in a high accuracy albeit useless classifier. In this specific example, 
Fp (?) can be considered as a more meaningful performance measure than misclassification 
rate. In general, performance measures, like Fa, are extensively used in practical problems 
(??). One of the striking characteristics of these performance measures is the non-linearity 
with respect to the in-class false negatives and false positives; whereas misclassification 
rate is a linear function of false negatives and false positives. Moreover, there is no convex 
surrogate loss function that exists for such non-linear measures; specifically, there is no 
surrogate loss function that exists for T-measure. Another interesting property specific to 
T-measure and Jaccard index is: it is a sample level measure and does not decompose over 
individual examples. These three aspects makes the optimization problem a difficult and 
interesting one. 

In the current paper, we study the theoretical and algorithmic aspects pertaining to the 
optimization of a set of non-linear performance measures called pseudo-linear performance 
measures. The commonly used performance measure F± is an example of pseudo-linear 
performance measure. Less commonly used measures like Jaccard index also come under 
this title; among many others. Here, we focus primarily on pseudo-linear notions of F- 
measure. We consider the setting in which a dataset, given as a set of feature vectors, is to 
be classified such that the T-measure (restricted to pseudo-linear functions) of the resulting 
classification is (approximately) optimal. In the literature, T-measures are also often called 
T-scores. Here we will stick to the first terminology, which refers to the measurement of 
performance, in order to avoid any confusion with classification scores, that is, the real¬ 
valued scores that may be provided by classifiers and that are thresholded to produce 
decisions. Unless otherwise explicitly stated, all the discussion in this paper refers to F- 
measure optimization. At a later point, we generalize the results to other pseudo-linear 
measures. 

Our principle goal is to study the algorithms for empirical optimality of pseudo-linear 
T-measures. Given a training set, our analysis proves that Optimal F Classifier for pseudo- 
linear T-measures can be found by minimizing the total misclassification cost of a cost- 
sensitive classification for each value of cost in an inner loop and select the best among the 
set of costs. Optimality in the state of the art algorithms for pseudo-linear T-measures are 
asymptotic whereas our results are valid in the non-asymptotic regime also. Furthermore, 
our analysis can be linked to the weighted-sum approach used in the multi-objective op¬ 
timization. Additionally, in case of binary Fp and multilabel-macro-F 1 ^, our experimental 
results suggest that selecting a classifier based on minimizing the total misclassification cost 
is same as selecting the optimal T-measure a posteriori. Our experiments also reveals the 
importance of thresholding classification scores to optimize T-measures. 

This article is an extended version of an already published conference paper (?). The 
article is organized as follows. Section [2] introduces basic definitions and notations used 
throughout the paper. It also present earlier works in T-measure optimization. Section [3] 


2 


F-measure Optimization 


presents the theoretical analysis, where we establish the pseudo-linearity of different prac¬ 
tical F-measures, and prove that Optimal F Classifier can be found by minimizing the 
total misclassification cost of a cost-sensitive classification for a specific cost value. We 
derive the values for the cost vector for many pseudo-linear F-measures. We establish the 
multi-objective view of the F-measure optimization problem and link our cost-minimization 
approach to the popular weighted-sum approach for solving multi-objective optimization 
problems. Section [5] presents the experimental results. We study the importance of thresh¬ 
olding for finding optimal solutions. We conclude the paper in Section |6j The proofs of all 
the propositions stated in Section [3] are deferred to Appendix [Aj 


2. Background and Related Work 

Here we give a brief review of the state-of-the-art methods for F-measure maximization. We 
start by introducing the notations used throughout in the paper; we also give the definitions 
of some basic quantities like Fg-measure. 

2.1 Notation and Basic Definitions 

We are given (i) a measurable space X xy, where X is the feature space and y is the (finite) 
prediction set, (ii) a probability measure p over X x T, and (Hi) a set of (measurable) 
classifiers 7i from the feature space X to y. We distinguish here the prediction set y 
from the label space C = {1,...,F}: in binary or single-label multiclass classification, the 
prediction set y is the label set C. but in multilabel classification, y = 2 C is the powerset 
of the set of possible labels. In that framework, we assume that we have an i.i.d. sample 
drawn from an underlying data distribution P on X x y. The empirical distribution of this 
finite training (or test) sample will be denoted by P. Then, we may take P as measure p 
to get results at the population level (concerning expected errors), or we may take p = P 
to get results on a finite sample. Likewise, the set of classifiers 'H can be a restricted set 
of functions such as linear classifiers if A is a finite-dimensional vector space, or may be 
the set of all measurable classifiers from X to y to get results in terms of Bayes-optimal 
classifiers. Finally, when required, we will use bold characters for vectors and normal font 
with subscript for indexing. 

Most of the previous work on pseudo-linear metric is centered around F^-measure in 
binary settings. Fg-measure is defined as the weighted harmonic mean of precision and 
recall. Precision is defined as the fraction of predicted positive instances that are indeed 
positive and recall is defined as the fraction of positive instances that are correctly pre¬ 
dicted as positive. Formally, we can define these metrics using classifier outcomes. Given 
a binary dataset and classifier, tp corresponds to the correct prediction of a positive label, 
tn corresponds to the correct prediction of a negative label, fn corresponds to the incorrect 
prediction of a positive label as a negative label, and fp corresponds to the incorrect pre¬ 
diction of the negative label as positive. In general, these outcomes are depicted using a 
confusion matrix, also called contingency table (See Table [2]). In terms of the classification 
outcomes ( tp , tn,fn,fp ), we formally define precision, recall and Fp associated with a binary 
classifier /i6H for a given sample (x, y) G (X x y) n as: 


3 


PUTHIYA PARAMBATH ET AL. 


(precision) 

Precision[h{x),y ) 

(recall) 

Recall (h(x),y) 

(binary—Fp) 

Fp(h(x),y ) 


E"=i tPiHxi)) 

E”=i [tp(H x i)) + fp(H x i)) 

E?=i tpiHxj)) 

E?=i [tp(h(xi)) + fn(h(xi))\ 

_(i + ^E^ifrfofo))_ 

EILiK 1 + P 2 )tp{h(xi)) +/3 2 fn(h(xi)) + fp(h(xi))\ 


In the above, dependence of label vector y on classification outcome is omitted for con¬ 
venience. The parameter j3 weights precision and recall in Fp\ To corresponds to precision, 
Too corresponds to recall, and F\ , the most widely used, corresponds to equal weights. In 
case of the example mentioned in the introduction, classifying a sample of 100 instances, 
the trivial classifier gives precision, recall and F\ values to 0. Precision does not consider 
false negatives, and recall does not consider false positives. So in practical problems, Fp 
is preferred. One thing to note: unlike misclassification rate, F-measure is not invariant 
under label switching i.e. if we change the positive label to negative, we get a different 
F-measure. Hence it is used in problems where correct classification of minority label is 
of vital importance. In multilabel and multiclass settings, three different definitions of F- 
measure can be found; namely instance-wise, macro and micro F-rneasures. We will give 
formal definition of these in Section [3] in connection with our theoretical framework. 


2.2 Related Work 

F-measure optimization had been studied on a limited basis in the past (?????). Last 
couple of years witnessed an increasing interest in this domain (?????????). Majority of 
the work was confined to F- measure maximization in binary classification settings, whereas 
very little work was done on multilabel and multiclass F-measure maximization tasks (??). 
? suggested an algorithm for finding locally maximal Fi-measure for binary classification 
problems by approximating the classification outcomes using logistic models. Since the 
objective function used is non-convex, the algorithm does not guarantee optimality. This 
issue is addressed by running the procedure multiple times and selecting the best in hand. 
The orthogonal problem of infering the hypothesis with optimal F\ from a probabilistic 
model is discussed by (?). In the scientific literature, the two problem formulation has been 
referred to as empirical utility maximization (EUM) and decision-theoretic aproach (DTA) 
respectively (?). 

The two formulations differ with respect to the definition of the expected F-measure. 
In case of the EUM based approach, population F-measure is defined as the F-measure of 
the expected tp,fp and fn. Formally, In EUM, expected F-measure is defined as, 

F EVM (h) = _ (1 + /3 2 )E [tp{h(x))] _ 

13 (1 + /3 2 )E[tp{h,(x))\ + /3 2 E[fn(h(x))] + E \fp(h(x))\ 

An optimal EUM classifier can be defined as, 

h* = argmax Fp VM (h) 
hen 
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In DTA, assuming a probability distribution p(Y ) on {0, l} n , expected F-measure is for¬ 
mally defined as, 

F$ TA (h) =E y ^ p(Y) [Fp(h(x),y)} 

An optimal DTA classifier is of the form 

h* = argrnax E Fp(h(x),y)p(y) 
hen yefo ,!} 71 

From an algorithmic point of view, DTA based algorithms are computationally more expen¬ 
sive than EUM algorithms. DTA based algorithms require an efficient method to estimate 
the joint probability and iterate over exponentially many combinations of h and y, and the 
problem of estimating exact probabilities is as hard as the original problem. But assuming 
i.i.d samples and considering the functional properties of F-measure (it is a function of 
integer counts (tp, fp, fn)), the above problem can be solved more efficiently. The algorithm 
given by ? runs in 0(n 4 ), where n is the number of examples. ? improved the efficiency of 
this algorithm, leading to a complexity in 0(n 3 ), using dynamic programming methodology. 
They also remark that the optimal classifier for binary F\ is of the form sign(p(y = l|rc)— 5*), 
where 5* is a threshold score dependent on the underlying distribution. ? extended the 
algorithm given by ? with dependence assumption and given a method to calculate optimal 
F classifier with 0(n 3 ) complexity in time, given n 2 +1 parameters of the joint distribution 
p(y). This algorithm was used in a multilabel setting for instance-wise F-measure (see Re¬ 
mark [3]). In addition to the high computational footprint, there is no optimality guarantee 
on finite samples. In general, optimality in DTA algorithms are asymptotic in nature (?). 

On the other hand, EUM based approaches are computationally less demanding, and are 
based on structured risk minimization (SRM) principle. Here we minimize an approximate 
surrogate loss function, and select the hypothesis with minimal error on the validation set. 
The most commonly employed EUM approach is to threshold the score obtained using 
linear classifiers like logistic regression or support vector machines (SVM) such that F± is 
maximized. An approximate surrogate function based approach named SVM pcrf is given by 
?, based on the observation that Fi is a sample level measure. In the suggested method, the 
discriminant function is defined over the linear combination of the feature vectors, where 
the scalar multiplier is the label associated with each feature vector in the training sample. 
Even though the reported experimental results were promising, the method does not offer 
any theoretical optimality guarantee. Moreover, our experiments establish that SVM pcrf 
is a sub-optimal method. ? also advocated for SVMs with asymmetric costs (that is, 
with different costs for false negatives and false positives) for Fi-measure optimization in 
binary classification. However, their argument, specific to SVMs, is not methodological but 
technical (relaxation of the maximization problem). 

In case of multilabel classification, ? argued that the multilabel-micro-F-measure can 
be optimized by thresholding the class confidence score, one label at a time. ? used k- 
nearest neighbours and SVM to generate scores. In general, thresholding cost-insensitive 
SVM scores does not guarantee empirical optimality, and the paper does not address the 
issue of hyperparameter selection of the backend algorithm {k of Fnearest neighbor and 
regularization co-efficient of SVM). 

? tackle the problem by combining different classification models. They combined two 
logistic models, (%) maximum likelihood logistic regression and (ii) approximate logistic 
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approximation (see ?) to maximize multilabel micro, macro and instance-wise F-measure. 
This line of work comes under multiple classifier systems. Multiple classifier systems are not 
widely used for F-measure maximization, and are still in nascent stages. In our knowledge, 
no proper statistical study regarding the optimality of the multiple classifier systems for 
F-measure maximization is done so far. 

Apart from F-measure, some of the most recent work discusses non-linear performance 
measures like Jaccard index (???). Following the footsteps of ?, ?? proposed algorithms 
to maximize linear-fractional performance performance measure by thresholding the class 
confidence score. But as mentioned earlier, results hold only asymptotically. 

In this work, we aim to perform empirical risk minimization-type learning, that is, 
to find a classifier with highest population level F-measure by maximizing its empirical 
counterpart. In that sense, we follow the EUM framework. Nonetheless, regardless of how 
we define the generalization performance, our results can be used to maximize the empirical 
value of the Fg-measure. 

3. Theoretical Framework and Analysis 

In this section, we present the theoretical framework which is at the heart of this work. 
Our results are mainly motivated by the maximization of F-measures for binary, multi¬ 
class, and multilabel classification. They rely on a general property of these performance 
measures, namely their pseudo-linearity with respect to the false negative and false positive 
probabilities. 

For binary classification, we prove that, in order to optimize the F-measure, it is suffi¬ 
cient to solve a binary classification problem with different costs allocated to false positive 
and false negative errors (Proposition [4]). However, these costs are not known a priori, so in 
practice we propose to learn several classifiers with different costs, and to select the best one 
according to the F-measure in a second step. Propositions [5] and [6] provide approximation 
guarantees on the F-measure we can obtain by following this principle depending on the 
granularity of the search in the cost interval. 

We first establish the results for the F^-measures in binary classification, and then extend 
to other cases of F-measures with similar functional forms that are used in multiclass and 
multilabel classification. We also briefly describe pseudo-linear notions of Jaccard index, 
which can also be solved using our framework. For that reason, we present the results and 
proofs for the binary case, succeeded by multiclass and multilabel F-measures. 

3.1 Error Profiles and Pseudo-Linearity 

3.1.1 Error Profiles 

The performance of a classifier h on distribution /r can be summarized by the elements 
of the contingency table (See Table [2J which contains the summary of errors. For all 
classification tasks (binary, multiclass and multilabel), the F-measures we consider here are 
functions of this non-diagonal elements of contingency table, which themselves are defined in 
terms of the marginal probabilities of classes and the per-class false negative/false positive 
probabilities. The marginal probabilities of label k will be denoted by Pk, and the per-class 
false negative/false positive probabilities of a classifier h are denoted by FN k{h) and FP k(h). 
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Their definitions are given below: 

( binary /multiclass) Pk = y({(x,y)\y = k}), FN ^(/i) = y({(x,y)\y = k and h(x) A:}) , 

FP k(h) = n{{(x,y)\y k and h(x) = k}) . 

(■ multilabel ) Pfc = jx({(x,y)\y G k}), FN j e (h) = n({(x,y)\k G y and k 0 h(x)}) , 

FP k{h) = n({(x,y)\y <£ k and k G h(x)}) . 

These probabilities of a classifier h are then summarized by the error profile E (h): 

E (h) = (FN 1 (h) , FPi(h) ,FN L (h) , FP L (h) ) G M 2i . 

3.1.2 Pseudo-Linear Functions 

Throughout the paper, we rely on the notion of pseudo-linearity of a function, which is 
itself defined from the notion of pseudo-convexity (See ?, Definition 3.2.1): a differentiable 
function F : V C —> M, defined on a convex open subset of R rf , is pseudo-convex if 

Ve, e'eP, F(e) > F(e') ^ (VF(e), e' - e) < 0 , 

where (.,.) is the canonical dot product on M rf . 

Moreover, F is pseudo-linear if both F and — F are pseudo-convex. In practice, working 
with gradients of non-linear functions may be cumbersome, so we will use the following 
characterization, which is a rephrasing of ?, Theorem 3.3.9, basically stating that level sets 
of pseudo-linear functions are hyperplanes: 

Theorem 1 (?) A non-constant function F : V —> M, defined and differentiable on the 
open convex set V C is pseudo-linear on V if and only ifVe G V , VF(e) fi 0 , and: 
3a: M —> and 3b: M —> M such that, for any t in the image of F: 

F(e)>t (a(t), e) + b(t) < 0 and F(e)<t (a(t), e) + b(t) > 0 . 

Pseudo-linearity is the main property of linear-fractional functions (ratios of linear func¬ 
tions) . 

Proposition 2 (Linear-fractional function) A linear-fractional function F : T> C R d —y 

M is the ratio of linear functions, F(e) = A non-constant linear-fractional function 

is pseudo-linear on the open half-space P = {e£ + (5, e) >0, a\ 0}. 

3.2 Pseudo-Linearity of F-measures 

Several notions of F -measures used in practical problems are pseudo-linear. Here, we estab¬ 
lish that binary Fg and multiclass/multilabel macro/micro F -measures are pseudo-linear 
functions. 
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3.2.1 Binary Classification 

In binary classification, we have FN 2 = FPi and we can write P-measures only by reference 
to class 1. Then, for any (3 > 0 and any binary classifier h, the P^-measure is 

F (h) = (l + ^KPi-FN!^)) 

J (l + /P)Pl +FPi(/i) -FNi(/i) 

We can immediately notice that F /3 is linear-fractional and hence by Proposition [2] it is 
pseudo-linear in FNi and FPi. Thus, with a slight (yet convenient) abuse of notation, we 
write the P^-measure for binary classification as a function of vectors in R 4 = R 2i : 


( binary) 


Ve g R 4 , Fp(e) 


(l + /3 2 )(P 1 -e 1 ) 
(1 + /3 2 )Pi + e 2 - ei 


where e* represents the i th element of the error profile e. A surface plot of Pi as a function 
of FNi and FPi with level sets is given in Figure[l] As the Theorem [T] states, it can be easily 
verified from the plot that level sets are hyperplanes. 

In the above, e* represents the i th element of the error profile e E E. A surface plot of 
Pi as a function of FNi and FPi is given in Figure [l] It can be easily verified from the plot 
that level sets are hyperplanes. 


3.2.2 Multilabel Classification 

In multilabel classification, there are several definitions of P-measures. For those based 
on the error profiles, we first have the macro-P-measure (denoted by MFg), which is the 
average over class labels of the Pj-measure of each binary classification problem associated 
to the prediction of the presence/absence of a given class: 

(multilabel—Maero) MFgie) = - V (1 + M(Pk ~ e 2 k- 1) 

; L ^ (1 + /3 2 )P k + e 2k - eafc-i 

MFa is not a pseudo-linear function of an error profile e. However, if the multilabel clas¬ 
sification algorithm learns independent binary classifiers for each class (a method known 
as one-vs-rest or binary relevance, see e.g. ?), then the k -th binary problem depends only 
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on e 2 fc-i and e 2 fc. The maximization of the macro-F-measure with respect to all binary 
classifiers is then a separable problem which boils down to independently maximizing the 
F^-measure for L binary classification problems. In other words, optimizing MFp consists 
in maximizing the pseudo-linear functions in e 2 fc-i and e 2 k that correspond to each Fa op¬ 
timization. There are also micro-F-measures for multilabel classification. They correspond 
to Fg-measures for a new binary classification problem over X x £, in which one maps 
a multilabel classifier h : X —> y (T is here the power set of C) to the following binary 
classifier h : X x C —> {0,1}: we have h{x,k) = 1 if k G h(x), and 0 otherwise. The 
micro-Fg-measure, written as a function of an error profile e and denoted by mFg(e), is the 
F^-measure of h and can be written as: 


(multilabel-micro) 


mFp(e) 


(1 + fi 2 ) Y.k=i( p k ~ e 2 fc-i) 

(! + P 2 ) Efe=l p k + Efc=l( e 2fc “ e 2fc-l) 


This function is also linear-fractional, and thus pseudo-linear in e. 


3.2.3 Multiclass Classification 

The last example we take is from multiclass classification. It differs from multilabel classi¬ 
fication in that a single class must be predicted for each example. This restriction imposes 
strong global constraints that make the multiclass classification significantly harder. As for 
the multilabel case, there are many definitions of F-measures for multiclass classification, 
and in fact several definitions for the micro-F-measure itself. We will focus on the following 
one, which is used in information extraction (e.g in the BioNLP Challenge ?). Given L class 
labels, we will assume that label 1 corresponds to a “default” class, the prediction of which 
is considered as not important. In information extraction, the default class corresponds to 
the (majority) case where no information should be extracted. Then, a false negative is an 
example (x, y) such that y 1 and h(x ) y, while a false positive is an example (x, y) such 
that y = 1 and h(x) y. This micro-F-measure, denoted mcFp can be written as: 

(1 + /3 2 )(1 ~ Pi ~ E|=2 e 2fc-i 

(1 + /3 2 )( 1 - Pi) - Efc=2 e 2fc-i + ei 

Once again, this kind of micro-F^-measure is linear-fractional and hence pseudo-linear in e. 

Remark 3 (Non-pseudo-linear F-measures) In multilabel settings, notion of instance- 
wise Fy has been used in the past (??????j. It is similar to the micro-F-measure (mFp) 
for multilabel case defined above, but defined over samples (instances) instead of labels. 
It is defined as the average of the per-instance F-measure. Hence, we calculate the F- 
measures for each instance independently (i.e. estimate mFp for each individual example 
by calculating t,p,fp,fn for each example in the sample) and take the average (arithmetic 
mean) over the number of samples. This measure can not be written as a linear-fractional 
function of “error profile” terms, hence it can not be solved using our framework. 


(multiclass-micro) 


me 


Fp(e) 


3.3 Optimizing F-Measure by Reduction to Cost-Sensitive Classification 

The Fg-measures presented above are non-linear aggregations of false negative/positive 
propotions that can not be written in the usual expected loss minimization framework; 
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usual learning algorithms are thus, intrinsically, not designed to optimize this kind of per¬ 
formance measures. We show in Proposition [4] that the optimal classifier for a cost-sensitive 
classification problem with label dependent costs (??) is also an optimal classifier for the 
pseudo-linear E-measures (within a specific, yet arbitrary classifier set Li). In cost-sensitive 
classification, each entry of the error profile is weighted asymmetrically by a non-negative 
cost, and the goal is to minimize the weighted average error. Efficient, consistent algorithms 
exist for such cost-sensitive problems (???). Even though the costs corresponding to the 
optimal E-measure are not known a priori , we show in Proposition [5] that we can approx¬ 
imate the optimal classifier with approximate costs. These costs, explicitly expressed in 
terms of the optimal E-measure, motivate a practical algorithm. Even though the discus¬ 
sion in this section is more general and applies to any pseudo-linear functions, we start with 
the discussion in binary settings. We give the proofs and results for binary Fp and extend 
the results to multilabel and multiclass E-measures in Section [3.41 

3.3.1 Reduction to Cost-Sensitive Classification 

Let E : V C M d — > M be a fixed pseudo-linear function. We denote by a : M —> M d the 
function mapping values of E to the corresponding level set of Theorem [l] We assume 
that the distribution /i is fixed, as well as the (arbitrary) set of classifier Li. We denote by 
£ {Li) the closure of the image of Li under E, i.e. £ {Li) = cl{{E{h) ,h E Li}) (the closure 
ensures that £ {Li) is compact and that minima/maxima are well-defined), and we assume 
£ {Li) C V. Finally, for the sake of discussion with cost-sensitive classification, we assume 
that a(f) E M^ for any e E £ {Li), that is, lower values of errors entail higher values of E. 

Proposition 4 Let F* = max E(e). We have: e* E argmin (a(E*), e) E(e*) = E*. 

e£S(H) ee£(' H) 

This proposition shows that a(E*) are the cost vectors, which are orthogonal to the level 
set of E at F* and may not need to be unique, that should be assigned to the error profile in 
order to find the optimal classifier in Li with respect to the measure E. Hence maximizing 
E amounts to minimizing (a(E*), E(/i)) with respect to h, that is, amounts to solving 
a cost-sensitive classification problem. This observation suggests that the optimization of 
pseudo-linear measures could be a wrapper of cost-sensitive classification algorithms. The 
costs a(E*J are, however, not known a priori. The following result shows that having only 
approximate costs is sufficient to have an approximately optimal solution, which gives us 
the main step towards a practical solution. 


Proposition 5 Let £q > 0 and £\ > 0, and assume that there exists <f> > 0 such that for 
all e,e' E £ {Li) satisfying E(e') > E(e), we have: 


E(e') - E(e) < 4> (a(E(e')) , e - e') 


Then, let us take e* E argmax e , g £^ E(e / ), and denote a* = a(E(e*)). Let furthermore 
a E Ml and h £ LI satisfying the following conditions: 


(i) 11a - a* || 2 < e 0 , 


(ii) (a, e) < min (a, e^+ei 
e'c£(H) N ' 
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We have: Ve G £ (H ), F(e) > F(e*) - $ • (2 e 0 M + £i) 


where M = max e 7 
e'e£(H) " 


The above proposition suggests that pseudo-linear measures could be optimized by wrap¬ 
ping cost-sensitive classification in an inner loop with an outer loop setting the appropriate 
costs. This proposition also gives an upper bound on the achievable optimal F-score. 
This value depends on the size of the maximum error associated with the given hypothesis 
space, M, measured in £2 sense and the constant <f>. The value of M depends on the selected 
hypothesis class (£ (P)). We call $ as discretization factor as it defines the granularity of 
the approximation. It depends on the specific form of F-measure and training sample. We 
can find an approximately optimal classifier using a procedure, where we search for an ap¬ 
proximately optimal cost and associated error profile by iterating through the preselected 
cost interval in small steps. Thus searching for a cost such that £q is close to zero, we 
can find an approximately optimal F classifier. £\ can be regarded as the approximation 
guarantee provided by the underlying cost-sensitive classification algorithm. Practical im¬ 
plementations use convex surrogate loss instead of the non-convex 0-1 loss. A discussion on 
convex approxmiation of 0-1 loss can be found in (?). $, the discretization factor gives the 
magnitude of the step size. A larger value of T indicates more fine-grained discretization 
(very small step size), and a smaller value of T indicates coarse- grained discretization. 
Later, we will derive the exact values of $ and the cost interval for specific F-measures. 


3.3.2 Discretization Factor and Cost Interval for Fp 

Here, we derive the values of the discretization factor (<h) and the range of the cost interval 
(a) for binary Fg-measure. 


Proposition 6 Fp defined in 


Section 3.2.1 satisfy the conditions of Proposition^with: 


(binary) Fp: 


$ = 


/m 


and a : t € [0,1] i-A (1+/3 2 — t, t, 0,0) . 


This proposition gives the exact values of T and the range for a in binary settings. Here the 
discretization factor depends on the marginal probability of the positive class (assume label 
1 represents positive class). A larger value of the discretization factor demands smaller step 
size in the cost interval. Looking at the approximation guarantee in proposition [5j with 
a larger value of <h, reasonable approximation can be obtained by taking £q close to zero. 
Intuitively, we can think of this as follows, higher values of <f> indicates a highly imbalanced 
data with very few positive examples, hence to eliminate the influence of class-imbalance, 
we need to discretize in smaller step through cost interval. Given the error profile (in 
the form of contingency table) and associated costs as a matrix, as shown in in Figure [2j 
corresponding Fg-measure is the sum of the elements of the Hadamard product of the two 
matrices. 


Corollary 7 For the F\-measure, the optimal classifier is the solution to the cost-sensitive 
binary classifier with costs (l — ^-, ^-) 

This proposition extends the result obtained by ? to the non-asymptotic regime. If we 
take P as the set of all measurable functions, the Bayes-optimal classifier for this cost is to 
predict class 1 when fi(y = l|x) > fiy (see ??). 
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Predicted Label 
P N 


(a) Contingency Table 

Figure 2: Binary Classification 

3.3.3 Algorithm for Fp Maximization 

Based on the above results, we give a practical algorithm to find optimal Fp . In case of Fp , 
the cost function a : [0,1] —> M rf , which assigns costs to probabilities of error, is Lipschitz- 
continuous with Lipschitz constant (4>) = max(l,/3 2 ). Hence it is sufficient to discretize 
the interval [0,1] to have a set of evenly spaced values {ti, (say, tj + \ — tj = £q/4 >) 

to obtain an eo-cover {a(fi),..., &(tc)} of the possible costs. Using the approximate guar¬ 
antee of Proposition [5j learning a cost-sensitive classifier ( hi ) for each a(fj) and selecting 
the one with minimum total misclassification cost((a(fj), hj(e))) on a validation set is suf¬ 
ficient to obtain a <h(2eoM + ei)-optimal solution. Our experimental results suggest that, 
in binary classification choosing a classifier by our proposed method is same as selecting a 
classifier with optimal F-measure a posteriori. Hence our final algorithm consists of select¬ 
ing a cost-sensitive classifier with optimal F-score.Our suggested algorithm is presented in 
Algorithm [I] 


Algorithm 1 Optimization of the Fg-measure 

1 

procedure OPTlMlZE_F / g(D,/3) 

> D = Data, /3 = (3 in Fp 

2 

O 

II 

-O 


3 

Split Training Data into two D tra , D va i 


4 

for t = (0 ... 1 + /3 2 ) do 

t> approximate cost 

5 

<j>,9,F = F_cs_learner(F tra , D vai ,t)', 

> learn cost-sensitive model 

6 

if F > bF then 


7 

$ = @ = d, bF = F ; 


8 

end if 


9 

end for 


10 

return (<b, 0) 


11 

end procedure 



True Positive 
(tp) 

False Negative 

C f n ) 

False Positive 

(fp) 

True Negative 
(tn) 


0 

1 + /3 2 - t 

t 

0 


(b) Cost Matrix 


The cost-sensitive classification algorithms that are used in the inner loop (step 5) re¬ 
turns the trained model. The predictscore method in the meta-algorithm simply returns 
the scores (score can be posterior probability, or geometric margin etc) on the validation 
set and computeFp returns the optimal F-measure and a score threshold (if any) on the 
validation data. Even though our theoretical results do not suggest thresholding the scores 
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Algorithm 2 Cost-Sensitive Learner for Fp 

1 

procedure F_CS_learner (Dt ra , D va i,t) > Dt ra 
Data, t=cost 

= Training Data, D va i = Validation 

2 

o 

II 

-O 


3 

for if G do 0 T = set of tunable cost-sensitive algorithm hyper-parameter 

4 

(f = cost_sensitive_learner(Dt r . a , t, if)] 

> generic cost-sensitive learner 

5 

0, F= comput eFp((j), D vah /3) 

> get optimal threshold and Fp 

6 

if F > bF then 


7 

$ = </,, 0 = 6, bF = F ; 


8 

end if 


9 

end for 


10 

return (<f>, 0, F) 


11 

end procedure 



a posteriori , experimental results indicate the need for a posterior thresholding of the scores. 
We will elaborate on this point in Section [5] This meta-algorithm can be instantiated with 
any cost-sensitive learning algorithm. The actual algorithm may simply consist of adjusting 
the hyper-parameters of a cost-insensitive classifier so as to optimize cost-sensitive classifi¬ 
cation, as in many practical implementation of cost-sensitive algorithm. This rudimentary 
approach results in considerable savings in computation time. 


3.4 Beyond Binary T-measure 


As mentioned earlier, many notions of T-measures in multiclass and multilabel problems 
are pseudo-linear and can be solved using our framework. Here, we derive the values of 
cost vector (a) and discretization factor (<h), and give optimal F- measure algorithm for 
pseudo-linear F- measures described in Sections |3.2.2 and 3.2.3 


3.4.1 Multilabel micro-T-measure 


satisfies the conditions 

of Proposition [5| with: 


Proposition 8 multilabel micro-F (mFp) defined in Section 


3.2.2 


(multilabel-micro) mFa: $ — -;- 

zLi p k 


and ai(t) 


1 + f3 2 — t if i is odd 
t if i is even 


Here the discretization factor depends on the sum of marginal probabilities of each 
label. A large value of indicates that majority of the labels are rare, and smaller value 
of 4> indicates that few labels are rare. Since the impact of misclassification of rare labels 
does not influence the micro-T-measure to a greater extend (T-score is independent of true 
negatives), we have to discretize in a smaller step only if the majority of the classes are 
rare. Given the above result on cost vector a and discretization factor 4>, and following the 
arguments given for Fp (here also the cost function a is Lipschitz-continuous with Lipschitz 
constant taking value max( 1 , /3 2 )), we can develop an algorithm for finding optimal classifier 
for mFp. Unlike in binary case, here we run cost-sensitive learner with discretized cost values 
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to find the classifier with lowest total misclassification cost ((a (ti), hi(e)}). Our proposed 
algorithm is given in Algorithm [3j The algorithm is similar to the Fg algorithm given 
in Algorithm [lj except for the fact that here we minimize the total misclassification cost 
instead of maximixing empirical Fg in the inner loop. Also, here we need the cardinality of 
the label space as an additional input parameter. Here the outer loop calculates the cost 
(a (t)) for each value of t as given in proposition [ 8 | The selected threshold is the one which 
minimizes the total misclassification cost ((a(t),e)) over all possible values of a (t) and e. 


Algorithm 3 Optimization of the mFg -measure 


1 

procedure OPTlMiZE_mF ( g(D,L,/3) 

> D = Data, L = |£|, /3 = (3 in Fp 

2 

bC = Too 


3 

bmF = 0 


4 

Split Training Data into two D tra , D va i 


5 

for t = (0 ... 1 T /3 2 ) do 

> Approximate Cost 

6 

n = gen_mi ? / 3 _cost_vector(L, t, (5) 

> Cost Vector 

7 

<t>,9 = mF_cs_learner(Df ra , D va i, n) 

> learn cost-sensitive model 

8 

9 , mF = cornput emFp(cf>, D va i, 9, j 3 ) 

> get the optimal threshold and mFp 

9 

if (mF > bmF ) then 


10 

bmF = mF , $ = </>, 0 = 0; 


11 

end if 


12 

end for 


13 

return (<E>, 0) 


14 

end procedure 



Algorithm 4 Cost-Sensitive Learner for mFp 

1 

procedure MF_CS_LEARNER(ZV m , D va i,H) 
Validation Data, n=cost 

> D tra = Training Data, D vai = 

2 

bC = Too 


3 

for -0 £ \V do o T = set of tunable cost-sensitive algorithm hyper-parameter 

4 

4> = cost_sensitive_learner(ZV m , n, ip); 

> generic cost-sensitive learner 

5 

9 , C= compute_cost(^>, D va i, n) 

> get optimal threshold and total 


misclassification cost 


6 

if (C < bC) then 


7 

TT 

II 

II 


8 

end if 


9 

end for 


10 

return (4>, 0) 


11 

end procedure 
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3.4.2 Multiclass micro-F-measure 


Proposition 9 multiclass micro-F (mcFp) defined in Section 3.2.3 satisfies the conditions 
of Proposition [5| with: 


( multiclass-micro) mcFa: $ — —--- 

^ 1 P /F(l-Pl) 


and afit) 


1 + /3 2 — t if i is odd and i 1 
< t if i = 1 

0 otherwise 


Following the arguments given for multilabel micro-F-measure, we can use the Algo¬ 
rithm [3] for finding optimal mcFp with a small modification to the gennnFp-Cost-vector 
method. The new cost generation method for multiclass micro-F-measure follows result of 
proposition [9j 


Remark 10 (Beyond F-Measures) Jaccard index is a set-based similarity measure. Given 
two sets, Jaccard index is defined as the ratio of intersection to union. Like F\-measure, it 
ranges from 0 to 1, where 0 indicates distinct sets and 1 indicates identical sets (7). It is 
used in cluster analysis and co-citation analysis to name a few. Some recent work (( 27 )) 
examined the use of Jaccard index as a performance measure in classification problems. The 
Jaccard index is a pseudo-linear performance function of per-class false negatives and false 
positives. We can define Jaccard indexes for binary, multiclass and multilabel problems in 
terms of the error profile entries, 


(binary) 

Ve € M 4 , 

(multilabel -micro) 

Ve € M 2i 

(multiclass-micro) 

Ve <E R 2L 


Jac(e ) = 


mJac(e ) = 


mcJac(e) = 


Pi ~ ei 
Pi + e2 

J2k=l( P k ~ e 2k-l 


Efe=1 P k + £fc=l 62 k 
1 ~ P 1 ~ £fc=2 e 2fc-l 

(1 - Pi) + ei 


As we can infer from the above equations, these quantities are pseudo-linear and hence, we 
can use the methodology developed in Section ^S. 3.1 .thresholding cost-sensitive scores, to find 
optimal Jaccard index classifier. Our analysis proves the remark of ? “We also see that 
algorithms maximizing the F-measure perform the best for Jaccard index’'. 


4. Relationship to Multi-Objective Optimization 

Finding “good” classifiers amounts to find good trade-offs between the different types of 
errors. In any case, it is a natural requirement that the chosen classifier has an error profile 
that is a minimal element of £ (H) according to the partial order of Pareto dominance, 
which is denoted by A and is defined as: 

Ve, e' G l d , e < e' VA: 6 {1,..., d} , ek < e k . 

The set of optimal solutions defines the Pareto front. 

error profile that is a minimal element of £ (H) according to Pareto-dominance (where 
e y e' iff efc > e' k for all k). This set of optimal solutions defines the Pareto front. 
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x 0 

X\ 

X2 


H{x) 

0.65 

0.30 

0.05 


fi(y = l\x) 

0.70 

0.40 

0.15 


classifier 

x 0 

x\ 

X2 

F? (%) 

h A {x) 

2 

2 

2 

2.22 

h B {x) 

2 

2 

1 

2.37 

hc(x ) 

2 

1 

2 

27.22 

h D {x) 

1 

2 

2 

73.83 

h E (x) 

1 

2 

1 

72.12 

hp(x) 

1 

1 

2 

75.24 

h G (x) 

1 

1 

1 

73.62 


Figure 3: Pareto front for a binary classification problem (T = {1,2}, the positive class is 
1), where the input space contains three points x\. X 2 , x%. The table on the left 
describes the data distribution, and defines the 8 possible classifiers and gives 
their F[ L -measure. 


Multi-objective optimization defines methods for finding the Pareto front, or approxi¬ 
mations of it (?), and one of the motivations is to find (approximately) optimal solutions 
of a vector function that is hard to optimize. The process is to generate candidate points 
in the Pareto front, and take the candidate with optimal value of the vector function. The 
advantage is generating candidate points is faster than the direct optimization of the vector 
function. In our case, goal is to find h E £ (%) that achieves small values of (a, e(h)) for a 
predefined cost vector a. 

The reduction from pseudo-linear functions to solving a series of cost-sensitive classifi¬ 
cation problems exactly corresponds to this Pareto front method. In fact, a general way of 
finding Pareto-optimal solutions of a multi-objective problems is called the weighted-sum 
method (see e.g. ??). Applied to error profiles, the weighted-sum method would minimize 
positive weighted combinations of the elements of the error profiles, which corresponds 
to solving a cost-sensitive classification problem. In usual multi-objective optimization 
settings, such a Pareto set method is not useful for pseudo-linear aggregation functions, 
because most such functions are linear-fractional, and single-objective problems with a 
linear-fractional objective function can be rewritten in terms of a linear objective with lin¬ 
ear constraints (see e.g. ?). In our context however, the linearization would not help because 
it would introduce constraints involving values of the error profiles, which are not linear in 
general. What we gain with the reduction to cost-sensitive classification (or, equivalently, 
with the weighted-sum method), is that efficient algorithms for cost-sensitive classification, 
which are known to work in practice and are asymptotically optimal, are already known. 
In addition, weighted-sum method require the users to know the relative preferences of the 
objectives in advance, which is not known in general. Hence the weight components are 
unbounded. Our reduction clearly defines a bound on the possible weights (a(t)). 
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The relationship between the reduction to cost-sensitive classification and the weighted- 
sum method allows us to discuss pseudo-linear F-measures in terms of Pareto-optimal so¬ 
lutions. It is well-known that in general, not all Pareto-optimal solutions can be found by 
the weighted-sum method; in fact, only those that are on the boundary of the convex hull 
of the feasible set can be reached. In general however, many classification problems have 
Pareto-optimal solutions that do not lie on this boundary, especially if the input space is 
finite (as is the case on any finite dataset). Figure [3] gives the example of the Pareto front 
of a binary classification problem with 3 examples. The pareto front can be depicted on 
a 2D plane where the axis are false positives and false negatives; up to a change of basis, 
this Pareto front is the ROC curve (??) for the problem. In the figure, the blue points 
on the left plot correspond to Pareto-optimal classifiers (none of them can be improved 
both in terms of proportion of false positives and false negatives), while the red curve is 
the Pareto set of the convex hull of the error profiles of the 8 classifiers. Our result of 
reduction to cost-sensitive classification proves that only the classifiers whose error profile 
is both Pareto-optimal and on the boundary of the convex hull are candidates as optimal 
classifiers for any pseudo-linear aggregation function (here, the candidates are ca,cd,cf), 
even though all classifiers are optimal for some trade-off rule. For instance, cb is the optimal 
classifier for the rule ” ‘minimize the proportion of false negatives under the constraint that 
the proportion of false positives is smaller than 0.1”’. 

5. Experiments 

This section illustrates of the accuracy of the algorithms suggested by our theoretical frame¬ 
work, using the Ei-measure, in binary and multilabel classification. Our experimental re¬ 
sults for binary and multilabel-macro E-measure (using binary relevance) shows that (i) 
choosing Optimal F Classifier by minimizing (a, e) is same as choosing classifier with op¬ 
timal E-measure a posteriori (ii) selecting a classifier by thresholding cost-sensitive scores 
is preferable to algorithms based on thresholding cost-insensitive classification scores: to 
maximize E-measure (in) In case of multilabel-micro E-measure, Optimal F Classifier is 
the one with lowest (a, e) value. 

We compare thresholded cost-sensitive classification, as implemented by SYMs and lo¬ 
gistic regression (LR), with asymmetric costs, to thresholded linear classifiers (SVMs and 
logistic regression, with a decision threshold set a posteriori by maximizing the Ei-score 
on the validation set). Besides, the structured SVM approach to Ei-measure maximization 
of ?, SVM perf , provides another baseline. For completeness, we also report results for non- 
thresholded cost-sensitive SVMs, non-thresholded cost-sensitive logistic regression, and for 
the thresholded versions of SVM perf . 

Since the practical cost-sensitive algorithms are based on convex surrogate loss opti¬ 
mization (?), the approximate cost approximation we presented in proposition [5] will not 
hold in general. We call the cost given in proposition [5] as actual cost and cost used in 
the practical surrogate loss based algorithm as surrogate cost. Since there is no one-to-one 
mapping between actual cost and surrogate cost, in practical implementations we have to 
iterate over the convex surrogate loss for each value of the actual cost. 

SVM and LR differ in the loss they optimize (weighted hinge loss for SVMs, weighted 
log-loss for LR), and even though both losses are calibrated in the cost-sensitive setting 
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(that is, converging toward a Bayes-optimal classifier as the number of examples and the 
capacity of the class of function grow to infinity) (?), they behave differently on finite 
datasets or with restricted classes of functions. We may also note that asymptotically, 
the Bayes-classiher for a cost-sensitive binary classification problem is a classifier which 
thresholds the posterior probability of being class 1. Thus, all methods but SVM pcrf are 
asymptotically equivalent, and our goal here is to analyze their non-asymptotic behavior 
on a restricted class of functions. 

For each experiment, the training set was split at random, keeping 1/3 for the validation 
set used to select all hyper-parameters, based on the maximization of the F\ -measure on this 
set. For datasets that do not come with a separate test set, the data was first split to keep 
1/4 for test. All results are averaged over five random splits i.e. hold-out validation with five 
random splits. The algorithms have from one to four hyper-parameters: (i) all algorithms 
are run with L 2 regularization, with a regularization parameter C E {2 -6 ,2 -5 ,..., 2 6 }; 
(%%) for the cost-sensitive algorithms, the cost for false negatives is chosen in {^f^,t E 
{0.1, 0.2,..., 1.9}} of Proposition 4 Q (in) for the thresholded algorithms, the threshold is 
chosen among all the scores of the validation examples; (iv) for kernel based SVM, we used 
radial basis function (RBF) kernel with 7 (measure of influence of a single training example) 
value 7 E {2~ 6 , 2~ 5 ,..., 2 6 }. 

The library LIBLINEAR (?) was used to implement non-kernel SVM^] and logistic 
regression. LIBSVM (?) library was used for the kernel SVM. A constant feature with 
value 100 (to simulate an unregularized offset) was added to each dataset. 

5.1 Importance of Thresholding 

Although our theoretical developments do not indicate any need to threshold the scores of 
classifiers, the practical benefits of a post-hoc adjustment of these scores can be important 
in terms of Fj-measure maximization, as already noted in cost-sensitive learning scenarios 
(??). We study the importance thresholding clasification scores a posteriori using a didactic 
data called “Galaxy”. The data can be visualized as given in Figure[4j The data distribution 
consist in four clusters of 2D-examples, indexed by z E {1,2, 3,4}, with prior probability 
p>(z = 1) = 0.01, p,(z = 2) = 0.1, p,(z = 3) = 0.001, and fi(z = 4) = 0.889, with respective 
class prior probabilities p,(y = 1| z = 1) = 0.9, fi{y = 1| z = 2) = 0.09, fi(y = l\z = 3) = 0.9, 
and n(y = 1| z = 4) = 0. “Galaxy” is an example of highly imbalanced dataset. 

We drew a very large sample (100,000 examples) from the distribution, whose optimal 
Fi-measure is 67.5%. Without thresholding the scores of the classifiers, the best Fi-measure 
among the classifiers is 58.0%, obtained by cost-sensitive SVM, whereas tuning thresholds 
enables to reach the optimal Fi-measure for SVM perf and cost-sensitive SVM. On the other 
hand, LR is severely affected by the non-linearity of the level sets of the posterior probability 
distribution, and does not reach this limit (best Fi-measure of 56.5%). Note also that, even 
with this very large sample size, the SVM and LR classifiers are very different. This result 
suggests that thresholding the classification scores a posteriori may improve the optimal 
F-scores, especially thresholding the cost-sensitive classifier scores. 


1. We take t greater than 1 in case the training asymmetry would be different from the true asymmetry 
(?)■ 

2. The maximum number of iteration for SVMs was set to 50,000 instead of the default 1,000. 
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before thresholding after thresholding 




X\ X\ 

Figure 4: Decision boundaries for the galaxy dataset before and after thresholding the clas¬ 
sifier scores of SVM perf (dotted, blue), weighted SVM (dot-dashed, cyan), un¬ 
weighted logistic regression (solid, red), and weighted logistic regression (dashed, 
green). The horizontal black dotted line is an optimal decision boundary. 


Name 

Type 

Labels 

Train 

Test 

Features 

Label Freq. (%) 
(min/max) 

Adult 

binary 

2 

32,561 

16,281 

123 

- 

Galaxy 

binary 

2 

18,000 

7,000 

2 

- 

RCV1 

multilabel 

101 

23,149 

10,000 

47,236 

0.008/46.6 

Scene 

multilabel 

6 

1,211 

1,196 

294 

13.6/22.8 

Siam 

multilabel 

22 

21,519 

7,077 

30,438 

1.4/59.8 

Yeast 

multilabel 

14 

1,500 

917 

103 

25.2/43.0 


Table 1: Dataset Attributes 


5.2 Binary Fp and Multilabel MFp 

The other datasets we use are Adult, RCV1, Scene, Siam and Yeast. In addition, we used 
a subsample from the Galaxy data to demonstrate the empirical validity of the algorithm. 
Adult, RCV1 and Yeast are obtained from the UCI repositorjQ and Scene and Siam from 
the Libsvm repositorjj^] The attributes of the data used in our empirical study are given in 
Table [U 

The results for binary -Fp and multilabel-macro-F (MFp) are reported in Tableland 
[3] respectively. As it is evident from the experimental results, cost-sensitive learning and 
thresholded cost-sensitive learning give optimal results, whereas other methods performs 
suboptimally. But the difference between methods is less extreme than on the artificial 
Galaxy dataset. The Adult dataset is an example where all methods perform nearly iden- 

3. https://archive.ics.uci.edu/ml/datasets.html 

4. http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html 
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Baseline 

SVM perf 


SVM 



LR 


Options 

- 

T 

- 

T 

CS 

CS&T 

- 

T 

CS 

CS&T 

Adult 

67.3 

67.3 

66.9 

67.5 

67.9 

67.8 

65.0 

67.7 

67.7 

67.9 

Galaxy 

48.4 

61.7 

43.1 

61.4 

58.0 

62.0 

35.4 

51.9 41.8 

56.5 


Table 2: T)-measures (in %) for baseline algorithms with their usual settings (-) and differ¬ 
ent options: T for thresholded classification scores, CS for cost-sensitive training, 
CS&T for cost-sensitive training and thresholded classification scores 


Baseline SVM perf SVM LR 


Options 

- 

T 

- 

T 

CS 

CS&T 

- 

T 

CS 

CS&T 

RCV1 

44.0 

52.8 

46.6 

54.2 

50.9 

54.5 

40.9 

52.9 

48.5 

53.3 

Scene 

68.3 

69.6 

66.2 

69.6 

69.6 

69.6 

67.0 

69.9 

69.8 

70.1 

Siam 

48.2 

52.8 

48.1 

52.4 

52.7 

53.4 

44.7 

51.9 

51.7 

52.2 

Yeast 

46.4 

46.4 

39.1 

46.2 

47.2 

46.3 

38.8 

47.4 

47.4 

47.2 


Table 3: Macro-F\-measures MF\ (in %) for baseline algorithms with their usual settings (— 
) and different options: T for thresholded classification scores, CS for cost-sensitive 
training, CS&T for cost-sensitive training and thresholded classification scores 


tical; the surrogate loss used in practice seems unimportant. On the other datasets, we 
observe that thresholding has relatively large impact, especially for SVM pcrf and cost- 
insensitive classifiers. The unthresholded and cost-insensitive SVM and LR results are 
very poor compared to thresholded and cost-sensitive versions. The cost-sensitive classifiers 
(thresholded and unthresholded) outperforms all other methods, as suggested by the theory. 
Te cost-sensitive SVM is probably the method of choice to optimize binary -Fp or multilabel- 
macro-F(Mi p g) when predictive performance is a must. On these datasets, thresholded LR 
still performs reasonably well considering its relatively low computational cost. In general, 
on the computational cost front, LR converges faster than SVM or SVM perf . 

Table [4] presents the optimal MFp -measure with kernel SVM. We used Radial Basis 
Function (RBF) as the kernel function and trained RBF SVM without a bias term. Our 
experiments exemplify our theoretical findings in kernel settings. In case of Scene, thresh¬ 
olding the cost-sensitive scores marginally improves the M Fi-score whereas in case of Yeast 
data, cost-sensitive kernel SVM outperforms other methods. In both cases, thresholding 
the cost-insensitive scores deteriorates the MF\-scores. 

5.3 Multilabel mFp 

In case of multilabel-micro-F-measure, we compare our algorithm with a commonly used 
method to find best mFp -score suggested by ?. In the proposed method, one assumes 
that an optimal classifier for macro-F-measure is an optimal classifier for micro-F-measure. 
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Options 

- 

T 

CS 

CS&T 

Scene 

68.9 

68.3 

70.5 

70.9 

Yeast 

48.6 

48.5 

48.8 

47.9 


Table 4: Macro-Ti-measures MF\ (in %) for SVM with RBF kernel with their usual settings 
(-) and different options: T for thresholded classification scores, CS for cost- 
sensitive training, CS&T for cost-sensitive training and thresholded classification 
scores 


Baseline 

SVM perf 


SVM 



LR 


Options 

- 

T 

- 

T 

CS 

CS&T 

- 

T 

CS 

CS&T 

RCV1 

r 

v - / min 

48.2 

49.6 

47.6 

49.7 

49.9 

50.2 

46.3 

49.8 

49.9 

49.9 

F 

1 max 

42.8 

44.7 

47.6 

44.1 

49.2 

44.2 

46.4 

44.3 

49.3 

44.5 

Scene 

c 

'—'min 

66.7 

68.5 

65.4 

68.7 

68.8 

68.6 

66.6 

69.2 

68.6 

69.4 

F 

‘ max 

66.6 

68.3 

65.2 

68.3 

68.3 

68.3 

66.4 

69.2 

68.6 

68.8 

Siam 

r 

'-''min 

59.2 

62.5 

60.3 

62.2 

62.6 

62.5 

60.2 

62.4 

62.0 

62.3 

F 

‘ max 

59.2 

62.0 

60.1 

62.0 

62.3 

62.2 

59.0 

61.8 

61.9 

62.0 

Yeast 

r 

v - / min 

61.8 

65.1 

64.1 

64.8 

65.6 

65.2 

63.3 

64.9 

65.3 

64.9 

F 

1 max 

60.2 

60.2 

60.6 

59.3 

60.7 

61.2 

63.2 

59.8 

61.0 

60.9 


Table 5: Micro-Fi-measures mF\ (in %) for for baseline algorithms with their usual settings 
(-) and different options: T for thresholded classification scores, CS for cost- 
sensitive training, CS&T for cost-sensitive training and thresholded classification 
scores. Two optimization strategies are compared: C min for mF\ by proposed 
algorithm and T max for rriF\ corresponding to optimal MF\ 


Hence, the micro-F-score corresponds to optimal macro-F-score is deemed as the optimal 
micro-F-score. We compare our algorithm for micro-F-score against the micro-F-score cor¬ 
responds to the optimal macro-F-score obtained by running binary relevance as explained 
in section f3. 2. 21 

Table [5] contains the multilabel-micro-F (mcFg) results for the multilabel datasets. The 
results clearly demonstrates that selecting micro-F corresponds to maximal macro-F (cor¬ 
respond to F max in table) always return suboptimal results. So in practice, algorithms based 
on per-label macro-F optimization should be avoided for micro-F optimization. In case of 
micro-F, effect due to thresholding is not very significant, except for RCV1 data. The un- 
thresholded classifiers performs nearly as good as the thresholded versions. This is true for 
SVM porf also. As suggested by theory, cost-sensitive classification is the preferred method 
to optimize multilabel-micro-F. Here also, thresholded LR can be considered as an alternate 
option considering the computational cost. 

Table [6] presents the optimal mcFp -measure with RBF kernel SVM. Similar to the MFp 
results, thresholding the cost-sensitive score gives better mFbeta results for kernel SVM. 
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Options 

- 

T 

CS 

CS&T 

Scene 

r 

min 

67.2 

67.1 

67.5 

67.1 

F 

‘ max 

67.0 

67.0 

67.2 

67.4 

Yeast 

r 

v ~ y min 

65.9 

66.3 

66.3 

66.6 

F 

1 max 

59.4 

62.9 

59.9 

63.5 


Table 6: mF\ for SVM with RBF kernel with their usual settings (-) and different op¬ 
tions: T for thresholded classification scores, CS for cost-sensitive training, CS&T 
for cost-sensitive training and thresholded classification scores. C min for rriF\ by 
proposed algorithm and F rnax for mF\ corresponding to optimal MF\ 



Figure 5: Plot of micro-F-measure against false negative cost 


5.4 Cost Space Search Overhead 

Since the actual cost associated misclassification differs from the cost associated with sur¬ 
rogate loss, it introduces an extra loop in our algorithm. Hence searching for optimal cost 
vector in the discretized cost interval might not be a good idea, especially when the value 
of is large. Here we do an empirical analysis of the functional dependencies between the 
actual cost and corresponding F-measure, and devise an improved version of the algorithms 
discussed in Section l3~4l 

Figure [5] contains the plot of micro-F-measure against false negative cost. From the 
plot, it is evident that micro-F-measure is a quasi-concave function of false negative cost. 
A function is quasi-concave, if every superlevel set of the function is convex (?). Formally, a 
function g : V C —> M, is quasi-concave if {x E T> \ g(x ) > a } is convex. It can be verified 
from the plot that superlevel sets are convex. Bracketing methods (?) are extensively used 
to find global maxima of unimodal functions like quasi-concave function. We will not be 
able to use the exact bracketing algorithm to find the optimal cost, since it requires the 
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knowledge of error profile associated with each value of F-measure). But we can use the 
idea of bracketing to limit the discretization interval. 

Here, we find three points ( p,q,r ), such that g(p) < g(q) > g(r), then instead of 
discretizing the whole interval, we can limit the discretization only to the sub-interval 
(p,r). We start with two intervals defined by the three points: start of the interval (0), 
median of the interval ( 1 " l 2 |g ) and the end of the interval (1 + (5 2 ). Then we search for the 
triplets (p, q, r) of given minimum sub-interval size inside the two intervals. In the simplest 
case, we find F-measure values corresponding to five points, two start points, midpoint 
( 1+ 2 ) and two midpoints of the intervals (0, ) and (. 1 + /3 2 ). Since the function 

is quasi-concave, the global maxima can be either on the mid point or on left or right of 
the mid point. Depending up on the F-measure values at the five points, we can limit the 
discretization only to one half. This way we can reduce the discretization space at least by 
half. 


6. Conclusion 

We presented an analysis of F-measures, leveraging the property of pseudo-linearity of 
specific notions of F-measures to obtain a strong non-asymptotic reduction to cost-sensitive 
classification. The results hold on any dataset, for any class of function and on any data 
distribution assumptions (label dependent or label independent). We suggested algorithms 
for F-measure optimization based on minimizing the total misclassification cost of the cost- 
sensitive classification. We demonstrated experiments on linear classifiers, showing the 
theoretical interest of using cost-sensitive classification algorithms rather than probability 
thresholding. It is also shown that for F-measure maximization, thresholding even the 
cost-sensitive algorithms helps to achieve good performances. 

Empirically and algorithmically, we only explored the simplest case of our result (Fg- 
measure in binary classification and macro-F^-measure and micro-Fg-measure in multilabel 
classification), but much more remains to be done. Algorithms for the optimization of 
the non-pseudo-linear notions of F-measures like instance-wise-F^-measure in multilabel 
classification received interest recently as well (??), but are for now limited. We also 
believe that our result can lead to progresses towards optimizing the micro-Fg measure in 
multiclass classification. 
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Appendix A. Proofs of Propositions and Corollaries 

Proposition |2| A linear-fractional function F : T> C M rf —y M is the ratio of linear functions 
F(e) = A non-constant linear-fractional function is pseudo-linear on the open 

half-space P = {e£ M d |ai + (S,e) > 0}. 


Proof A linear-fractional function F : e 6 M d i-a- a °^,T’ e | 


Qi + ( S , e) > 0 is pseudo-linear. 


F(e) < t + (t> e ) < t(ai + (5, e)) 
=>(ao — ta i) + (7 — e) < 0 

Now reversing the inequality, we obtain; 


F(e) > t («o — tcci) + (7 — i<5, e) > 0 
Above equations represent open hyperplanes. 

nP /A (ai + ( 5 , e )) 7 — (a 0 + ( 7 , e))5 
VF(e) = -( Ql + <5,e))=-* ° 

The gradient term is constant if 5 and 7 are propotional and non-zero otherwise. The 
above conditions confirm the requirements for the pseudo-linearity given in Theorem [l] and 
hence the result. □ 


Proposition 0 Let F* = max Fie), we have: e* E argmin (a(F*) , e) F{e*) = F* . 

eeS{W) e££(H) 

Proof Let e* E argmax e / e£(7) P( e 0> an d let a * = a(.F(e*)) = a (T*)- We first notice that 
pseudo-linearity implies that the set of e E V such that (a*, e) = (a*, e*) corresponds to the 
level set {e E T>\F(e) = F{e*) = F*}. Thus, we only need to show that e* is a minimizer 
of e' i— (a*,e') in £ {hi). To see this, we notice that pseudo-linearity of F (see Theorem [l]) 
implies 

Ve' E V, F(e*) > F(e') =7 (a*,e*) < (a*,e') , 
and since e* maximizes F in £ (hi), we get e* E argmin e / g £^ (a*,e') . □ 

Proposition [5] Let £0 > 0 and ei > 0, and assume that there exists > 0 such that for all 
e,e'Ef {PL) satisfying F{e') > F{e), we have: 


F{e') - F{e) < $ (a(F(e')) , e - e') 


(1) 


Then, let us take e* E argmax e / g £^ F{e'), and denote a* = a(i ? (e*)). Let furthermore 
a E Mi and h £ PL satisfying the following conditions: 


(i) ||a — a *|| 2 < e 0 


(ii) (a,e) < min (a^i+ei 
e'e£(H) N ' 


We have: Me £ £ {Pi ), F{e) > F(e*) - $ • (2e 0 M + £ 1 ) 


where M = max e 7 
e'e£(H) " 
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Proof Let e' E £ {PL), we can write (a, e') = (a*,e'} + (a —a*,e'). Applying Cauchy- 
Schwarz inequality and condition (i), we get 

(a,e') < (a*,e') + ||&-a*|| 2 ||e'|| 2 
< (a*, e ) + £q M . 


In particular, we have: 


min (a, e 7 ) < min (&*,e') + £oM 
e'e£(H) x ' e'e£(H) X ' 

< (a*, e*} + £qM , (2) 

since e* E argmin e / g £^ (a*,e r ) as shown in Proposition |4j 

Similarly, we have (a*, e) = (a, e) + (a* — a, e) ; applying Cauchy-Schwarz and conditions 
(i) and (ii), we have: 


Ve E £ (PL) , (a*,e) < (a, e) + ||a* - a|| 2 ||e|| 2 
— A £oAf 

< min (a, e') + £1 + £oAf . (3) 

e>€£(H) ' 

Combining Inequalities ([ 2 ]) and ([3]), we get 

Ve€£(fH), (a*,e) < (a*,e*)+ £ 1 + 2 e 0 M 
Ve E £ (PL) , (a*, e — e*) < £1 + 2 £ 0 M , 

and the final result follows from Assumption ([Tj) . □ 


Proposition [6] Fp-measures defined in 
0 with: 


Section \ 3.2.1 satisfy the conditions of Proposition 


(binary ) Fp : 


$ = 


fi 2 Pi 


and a : t E [0,1] (1+ /3 2 — t, t, 0,0) . 


Proof Since Fp is linear-fractional as a function of the error profile, it is pseudo-linear on 
the open convex set {e E M d | (1 + fi 2 )Pi — ei + e^ > 0} (i.e. when the denominator is strictly 
positive). Moreover, for every set of classifiers PL , we have £ (PL) C Vq = [O, Pi] x[0,l — 
Pi] X [1-Pi] X [l,Pi], 

Now, by the definition of Fp, we have 

Ve E Po> Pp( e ) £: t ■w’ (1 + /3 2 — t)ei + te 2 + (1 + fi 2 )P\(t — 1) > 0 , 

and the equation still holds by reversing the inequalities. We thus have that a (t) = 
(1 + (3 2 — t, t, 0,0) satisfy the condition of Theorem |l] (with b(t) = (1 + fi 2 )P\(t — 1)). 

We now show that the condition of Equation [I] is satisfied for a (t) = (1 + fi 2 — i, t, 0,0) 
and all e, e 7 E T>o by taking <f> = . To that end, let e and e' in £ (PL) and t and t' in 
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M such that t' = Fg(e') > Fg(e) = t. Denote by e the quantity (a — e'). Note that 
e > 0 and that: 


0 = 

(a (*),e) 

+ Hi) 

= (1 + f3 2 - t)e i 

+ 

te 2 

+ 

(i+/3 2 : 

) p i(t- 1) 

0 = 

(a(t'),e 7 ) 

+ m 

II 

+ 

to 

to 

1 

c-+-^ 

+ 

t'e' 2 

+ 

(i+/3 2 : 

) p iW- 1) 

£ = 

(a(t'),e-e') 


= (1 + (3 2 - t')e 1 

+ 

t'e 2 

+ 

(i+/3 2 ; 

) p i(t'~ i) 


where the first two equalities are given by the definition of hyperplane corresponds to 
Fg(e) = t and Frj(e') = t!. and the last one is obtained from the definition of (a — e r ). 
Taking the difference of the third and first equality, we obtain: 

£ = (t - t')e i + {t' - t)e 2 + (1 + P 2 )Pi(t' - t ) 

From which we get, since (1 + fi 2 )P\ — e\ + e 2 >0 for e e Vq: 

Fp( e ') ~ Fp(e) =t' -t = e((l + fi 2 )Pi - ei + e 2 ) _1 < , 

because /3 2 Pi the minimum of (1 + f3 2 )Pi — e\ + e 2 on Vq (taking e± = P\ and e 2 = 0). We 
obtain the result since e = (a(t / ),e — e 7 ) by definition. □ 

Corollary [7] For the F\-measure, the optimal classifier is the solution to the cost-sensitive 
binary classifier with costs (l — Kr, ^-) 

Proof From Proposition 4, by putting ft = 1, we have 

(2 - F*)e i + e 2 F* + 2P 1 (F* - 1) > 0 


dividing by 2, we get 


(l-^)e 1 + e 2 ^ + P 1 (F*-l)>0 


F* F^_ 
2 ’ 2 


□ 


Cost vector, a (t), according to Theorem [I] is (1 

Proposition [8] multilabel micro-F (mFp) measures defined in Section 3.2.2 satisfy the con¬ 
ditions of Proposition with: 


(multilabel -micro) mFp: $ — 


P 2 Eti P k 


and afit) = 


1 + /3 2 — t if i is odd 
t if i is even 


Proof 


mFp(e ) < t 


(1 + fi 2 ) Efc= i( p k ~ e 2k-l) 

(1 + fi 2 ) Efc=i p k + Efc=l( e 2 fc - e 2 fc_l) 


< t 


(1 + /3 J — t) 22 e 2k-l + t 22 e '2k + (1 + fi 2 )(t — 1) 22 p k — ® 


k =1 


k =1 


k =1 
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Thus, we have that 


ai(t) = 


1 + (3 2 — t if i is odd 


t if i is even 

Following the same arguments as in Proposition: 4, we get 

L L L 


mFp(e') — mFp(e) = t! — t = e 


I -l 


(l + /3 2 )^P t -^ e2k-l + ^ &2k 


< 


p 2 zLi Pk 


k= 1 k =1 k =1 

because ft 2 Zk=i Pk the minimum of (1 + (3 2 ) Zk= i Pk ~ Zk= i e 2 fc-i + Zk= i e 2 k in the 
respective domain (taking e 2 fc-i = Pk and 62 k = 0). We obtain the result since e = 
(a(t'), e — e') by definition. □ 


Proposition [9l multiclass micro-F (mcFp) defined in Section 3.2.3 satisfy the conditions 
of Proposition \fn with: 


(multiclass-micro) mcFp: $ — 


1 


fi 2 (i-Pi) 


and afit) = < 


1 + /3 2 — t if i is odd and i 1 


if i = 1 

otherwise 


Proof 


(multiclass—micro) mcFp: T — 


fi 2 (l-Pl) 


and afit) = < 


1 + f3 2 — t if z is odd and i 1 


if i = 1 
otherwise 


mcFp(e ) < t 


(1 + ^ 2 )(1 - Pi - Y,k=2 e 2fc-l) 

(1 + j3 2 )( 1 - Pi) - Yjk=2 e 2k—l + ei 


< t 


==> (1 + fi 2 — t ) e 2 k-i + te i + (1 + fi 2 )(t — 1)(1 — Pi) > 0 

Thus, we have that 


k =2 


afit) = < 


1 + /3 2 — t if i is odd and i 1 
t if i = 1 

0 otherwise 


Following the same arguments as in Proposition^, we get 

L 


mcF/s(e') — mcFp(e) = t 1 — t = e 


(1 + /3 2 ) (1 — Pi) — e2fc-i + ei 


k =2 


-i 


- R2 


^2(1 -Pi) ’ 


because /3 2 (1—Pi) the minimum of (1+/3 2 )(1 —Pi) — Zk =2 e 2 fc-i+ei in the respective domain 
(taking ]P fc=2 e 2 fc-i = 1 — Pi and ei = 0). We obtain the result since e = (a(t'), e — e') by 
definition. □ 


27 















